The major libraries that we use are GGally to get a correlation between many categories, ggplot2 to plot graphs and dplyr for other exploratory data analysis functions
Let’s have a look at a small chunk of the wine dataset.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## [1] 1599
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 3: 10
## 1st Qu.: 9.50 4: 53
## Median :10.20 5:681
## Mean :10.42 6:638
## 3rd Qu.:11.10 7:199
## Max. :14.90 8: 18
We find most wines have a fixed acidity level 6 to 10, and it follows a normal distribution
Most wines use a precise range of sugar in wine of about 2 +/- 1.
Most wines use a precise range of sugar in wine of about 2 +/- 1.
The amount of chlorides in wine are about 0.1%, and they have very low variability.
Let’s have a clearer look at the histogram
Most wines have a distribution of about under 20 of free sulfur dioxide but many go beyond that value
Density of wine follows a normal distribution with a value between 0.990 and 1
Wine is a fairly acidic drink with a pH of between 3 and 3.6
Most wines are rated 5 or 6. Good wines - 7 and few rare great wines rated 8.
The plot shows that the alcohol percent us usually between 9-12. Few have higher.
Boxplot of alcohol
The amount of sulphate used is usually between 0.4 to 0.8.
removing the outliers, we find the appropriate amount used in most is between 0.4 to 1.
Boxplot of Sulphates
The plot shows that the most wine have a citric acid level of 0.5 and below.
Boxplot of Citric Acid
Histogram
Boxplot of Volatile Acid
There are 1599 redwines in the dataset with the features fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density and finally the main one - quality
Quality is blind tested by experts and rated from 1-10. The data type of quality therefore had to be changed to ordinal.
Quality is blind tested by experts and rated from 1-10. Most wine is rated between 5 to 7, having an alcohol level between 9.5 to 11.1. The plot shows that the range most of the quality score fall would be between mostly 5,6 and many also have score of 7.
To get a wider idea about the entire data, the summary of all fields, against all fields are created in graph below. We furthur investigate based on this result. Features that have a good normal distribution and Features that have a high correlation.
focusing on drinks with high amount of sulphate
Sulphate and alcohol shows almost no correlation. They are hence definitely 2 independent factors that affect the quality of the alcohol
Higher citric acid and lower volatile acid
Another interesting observation that doesn’t make much sense is how volatile acidity has a positive correlation on pH. pH increases with reduced acidity. And it’s not a correlation but science. But according to this correlation, the pH actually increases.We try to plot this graph and find out a lot of outliers in this case, which might be causing the
We can see a clear correlation between quality and alcohol level
There is a clear correlation betwen sulphate level and quality. We however furthur observe how some drinks try to add high amount of sulphates. But their drink quality are usually rated average. And the best drinks don’t try to push the level of sulphates. Lets subset that particular data and find how much alcohol they have. We find that although the sulphate level are high, the alcohol level is low in the average quality alcohol.
Volatile acidity refers to the steam distillable acids present in wine, primarily acetic acid but also lactic, formic, butyric, and propionic acids. So lesser the volatile acidity, the better the wine quality.
Density has a negative correlation with alcohol, and possibly reduces with reduced alcohol content.
Sulphate marginally improves quality of drink according to the graph below. But there are a lot of high level sulphate drinks that are merely average and not exceptional. We investigate this in the next visualisation.
It plays an important role in preventing oxidization and maintaining a wine’s freshness.
We put both together and look for a pattern. Quality of drink has a lighter shade of blue. We notice how low volatile acidity and high citric acidity region has better quality of drinks.
It is difficult to say if it’s the higher alcohol quantity or the lower density that the experts prefer. Because higher alcohol, is correlated with lower density. High alcohol & low density should give a better quality of alcohol acording to the correlations. But high alcohol is related to low density in the first place, we find that with increasing alcohol, density reduces. This proves that buying high alcohol drinks, means, buying low density drinks as they both are slightly related. At the range of wine that were scored 6 and 7, you notice that many of them have a sulphate level higher than 0.8. But either ways, higher alcohol or lower density is correlated with better quality of drinks.
We already know from the bivariate analysis that sulphate, alcohol aren’t related like alcohol and density was. So it is a clear indication that high level of sulphates and alcohol, improves the quality of drink.
Drinks with high amount of sulphate investigated along with alcohol. We find that high amount of sulphate helps make the drink good, but then alcohol helps it make the drink even better. But if you don’t have Sulphate, then you can get away with high Alcohol Level.
Higher citric acid helps in improving the quality of drink, significantly. And from univariate analysis we find that most citric acids are rated low.
We see a clear pattern of how higher quality drinks mostly lie in the region of higer sulphate and alcohol level.
We can observe that almost every drink with alcohol level of above 11% have a rating of 6 and above which is average. We also notice how the ratio of drink being rated 8 and above over the rest, increases gradually with alcohol level of 12% and above.
At first, it seemed like there was hardly any correlation. The data didn’t make much sense. But on furthur investigation, subtle patterns and informations were retreived which helped build an idea about what chemicals might help in improving the quality of wine.
There is not much of a correlation between wine and it’s chemicals. There is no golden bullet formula to making a great wine, with only chemicals according to this analysis. But we can consider three elements. Alcohol, citric acid and sulphate helps it make the drink better.
At first, it seemed like there was hardly any correlation. The data didn’t make much sense. But checking the bivariate correlations gave a better sense of the data. A combination of two or more features seemed to affect the quality of the analysis.
More often I got stuck trying to implement a good visualisation using the colour palette. And the syntaxes were pretty unusual for me as it was my first time with R. But finding the right examples and resources outside did help me overcome these problems.
Finding other interesting patterns also helped in building a better plot. One important question that bothered me was how pH increased with increase in volatile acid, when it was supposed to be reduced. I have my own theories, but I don’t have any information about it that is credible, but my own opinions. It would be worth investigating in the future.